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Abstract 

This paper describes a framework for analysing matches in multiple data sets. The framework described is quite 
general and can be applied to a variety of problems where matches are to be found in data surveyed at a number 
of locations (or at a single location over a number of days). As an example, the framework is applied to the 
problem of false matches in licence plate survey data. The specific problem addressed is that of estimating how 
many vehicles were genuinely sighted at every one of a number of survey points when there is a possibility of 
accidentally confusing two vehicles due to the nature of the survey undertaken. 

In this paper, a method for representing the possible types of match is outlined using set theory. The phrase 
types of match will be denned and formalised in this paper. A method for enumerating A4 n , the set of all types of 
match over n survey sites, is described. The method is applied to the problem of correcting survey data for false 
matches using a simple probabalistic method. An algorithm is developed for correcting false matches over multiple 
survey sites and its use is demonstrated with simulation results. 
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1. Introduction 

In the analysis of roadside survey data, it is of- 
ten desirable to analyse matches between several 
data sets simultaneously. For example, we might 
wish to answer questions of the general type "How 
many drivers are seen at point A, point B and point 
C?" or "How many vehicles are seen on all five sur- 
vey days?" This paper attempts to create a general 
framework for the analysis of matching between 
data from more than two surveys. The framework 
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is then applied to the specific case of false match- 
ing in partial licence plate surveys (that is non- 
matches which are mistaken for matches because 
only part of the licence plate is observed) . It should 
be stressed throughout that the framework out- 
lined is applicable to any data series where matches 
are sought between two or more distinct data sets. 
While the work is placed in the context of licence 
plate surveys (and further in the context of licence 
plate surveys using a specific type of British licence 
plate) the results are much more general than this. 

Licence plate surveys are commonly used in the 
study of traffic systems, particularly when mea- 
surements of the same vehicle are required more 
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than one point (for example, calculating travel 
times or the routes of vehicles). Although auto- 
mated techniques are becoming more common 
(GPS, toll-tags and automated recognition cam- 
eras) the manual licence plate survey remains an 
important tool for the road transport engineer. If 
a road with a high volume of traffic is being sur- 
veyed then it is often the case that only part of the 
licence plate is recorded. When this is the case, 
the possibility of spurious matches occurs. To take 
an example, standard British licence plates used 
to be of the following form: single letter, three dig- 
its, three letters: e.g. A123BCD. This form will be 
used throughout the paper, however, it must be 
stressed that this method would work with partial 
observations of any type given the assumptions 
stated later. If a surveyor only recorded the first 
letter and three digits, then a vehicle A123ABC 
would not be distinguished from a vehicle A123XYZ 
since the disambiguating information (the final 
three letters) would not be recorded. 

While the chances of such a false match are low, 
quite often the combinatorics of the problem means 
that the actual recorded number of false matches 
remains high. To mathematicians, this is familiar 
as the celebrated Birthday Paradox. The Birthday 
Paradox asks the question "How many people must 
we have in a room before we might expect that two 
share the same birthday?" Intuitively, we might 
expect this to be quite a high number (since it 
is unlikely that any two people share a birthday). 
However, the number of pairs of people in a room 
goes up with the square of the number of people in 
the room (n 2 — n)/2. If we made the assumption 
that the chance of two randomly selected people 
sharing a birthday is one in 365 then we only need 
twenty three people in the room before it becomes 
likely (probability above 50%) that two will share a 
birthday. Combinations in multiple point surveys 
work similarly. If we had two survey sites, each 
with one thousand observations then this is one 
million pairs of observations. If the chances of a 
false match in a given pair are only one in a ten 
thousand, we will still get (on average) one hundred 
false matches. This could well be larger than the 
actual number of genuine matches in the data set 
and will certainly be a significant bias. 

This paper attempts to provide a sound theo- 



retic backing (using the well-known framework of 
set theory) to matching problems across multiple 
data sites. In section two, a general background of 
matching problems in licence plate data is given to 
put the problem into context within the transport 
field. In section three, the concept of types of match 
is formalised using the standard set theoretic con- 
cept of an equivalence class. In section four, a sim- 
ple method is given for constructing the set A4 n , 
the set of every possible type of match across n 
survey sites. In section five, partial ordering is in- 
troduced to apply the problem to false matches 
due to incomplete observations. In section six, an 
algorithm is given for correcting false matches us- 
ing the framework developed in sections three, four 
and five. Finally, in section seven, computational 
results are given on artificially generated survey 
data. The work in this paper can be found in a 
much expanded form in Chapter Four) and an 
example of the method being used on real road 
traffic data is found in (jl|, Chapter Five). The set 
theory used in this paper is extremely simple (just 
the concepts of equivalence class and partial order 
are necessary) and would be covered in any stan- 
dard text on the subject, for example (0). 



2. The false match problem in licence plate 
data at multiple sites 

Throughout this paper, the examples are given 
using an old form of British licence plate — it 
should be stressed that this is not necessary for this 
framework and is done purely for the sake of ex- 
ample. The work described here assumes nothing 
about the nature of the individuals being observed 
other than the restrictions described in Definition 
20. Similarly, when the phrase observation sites is 
used throughout this paper, this can mean either 
geographically distinct observation sites or a sin- 
gle geographical location observed for a number of 
days or any combination of times and locations. 
(In the work which motivated this research, the 
experimenters were interested in finding vehicles 
which travelled between three distinct geographi- 
cal locations on two consecutive days. This would 
count as six observation sites in the terminology 
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used here.) Note that no time information is used 
here although time information is often available 
for such surveys. It is hoped that a future improve- 
ment to this method will make use of time informa- 
tion about observations to reduce uncertainties. 

It is often the case that on-street traffic sur- 
veys collect partial vehicle licence plate informa- 
tion. [The reason for collecting partial rather than 
full licence plate information is that the recording 
and transcription of the data is often done manu- 
ally and time constraints would preclude recording 
a full plate.] This information can then be used to 
reconstruct travel times and to infer route infor- 
mation about drivers. In partial plate data, how- 
ever, problems can occur from false matches as dis- 
cussed above. Of course, false matches could also 
occur through recording or transcription errors. 
While this paper will not discuss these problems, 
it is in principle possible to extend this framework 
to cover recording and transcription errors. 

In the case of two survey sites and no recording 
or transcription errors the situation is relatively 
clear. If our data shows that a match occurs be- 
tween two observations (one from each site) then, 
this must mean that either the same vehicle has 
been observed at both, or that two different vehi- 
cles have been observed which happened to have 
the same partial licence plate. At multiple sites 
the situation is much more complex. An apparent 
match at four survey points may be any of the fol- 
lowing: a true match (the same vehicle seen at all 
four points); a different vehicle at each of the four 
points which (by coincidence) have the same par- 
tial plate; a vehicle at survey point one and two 
which has the same partial plate as a second ve- 
hicle at survey points three and four; or any other 
of fifteen total possibilities. The problem becomes 
more difficult as the number of sites increases. In- 
deed it is not immediately clear how to enumer- 
ate the number of ways in which a match as de- 
scribed above can occur over multiple data sites. 
This issue is not a trivial one. In real licence sur- 
veys, the number of false matches is often greater 
than the number of true matches. In |l|, Chapter 
5) two survey sites with a flow of approximately 
one thousand vehicles at each were found to have 
ninety observed matches between vehicles despite 
the fact that (given the positioning of the sites) it 



would be extremely unlikely for any drivers at all 
to travel between them. 

A number of researchers have approached the 
false matching problem for licence plates. An early 
approach for two sites is given by (|3j) which uses 
a simple probabalistic correction. Several methods 
are described in Qj including the possibility of two 
point matches between vehicles observed at pairs 
of sites selected from several survey sites (for exam- 
ple entering and leaving a cross-roads) . A graphical 
procedure for visualising matches based upon jour- 
ney time between two sites is given by ©.Methods 
in this paper are useful for any analysis of data in 
which time between observations is a factor. Fur- 
ther refinements for site pairs, including a maxi- 
mum likelihood method based upon assumptions 
about travel time distribution are given in |6() and 
(0). However, all of these methods concentrate on 
matches between pairs of sites and the majority of 
them also assume that journey time information 
can be used to aid in finding false matches, which 
is not the case if, for example, we are interested 
in correcting false matches at the same site over 
different days. The method described in this pa- 
per concentrates on matches between observations 
at more than two sites, particularly where journey 
time information is not available or cannot be used. 

It should be emphasised again that, while this 
work is presented within the context of licence 
plate surveys (indeed within the context of licence 
plate surveys on a specific type of British licence 
plate) the results presented are extremely general. 
These results would be applicable to any type of 
survey data where individuals are sought in more 
than two data sets and where a possibility of con- 
fusion between observations of individuals exists. 
Applications for this technique are being sought 
in other areas such as DNA matching and sugges- 
tions for suitable data sets would be welcomed by 
the author. 



3. Equivalence classes for representing 
types of match 

In this section, notation is given, with examples, 
to describe a mathematical framework for investi- 
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gating matches in multiple data sets. For the con- 
venience of the reader the notation used through- 
out this paper is gathered here for reference and 
defined as it occurs throughout the paper. In gen- 
eral bold lower case x is used to indicate a tuple 
(ordered set). Upper case M is used to indicate a 
set and bold upper case M is used to indicate a set 
of sets. Caligraphic lettering S is used to indicate 
higher order entities such as sets of tuples or sets 
of sets of sets. 

The following specific notation is used. 

- n — the number of sites under investigation. 

- #M — the number of members of set M . 
-Si — the set of observations at site i. See Defi- 
nition 1. 

- y — a tuple of observations, one from each site. 
See Definition 2 

- S — the set of all possible tuples of observations. 
See Definition 3 

- M = {Mi , M 2 , ■ ■ ■ , M m } — a type of match. See 
Definition 5. 

- M n — the set of all types of match for n sites. 
See Definition 6. 

- C(y) — the type of match of a tuple of observa- 
tions y. Sec Definition 7. 

- A n — the set of sets {{1,2, ... ,n}} representing 
the same observation across all sites. See Defini- 
tion 9. 

- y* — the tuple of partial observations from the 
tuple y. See Definition 11. 

- <S* — the set of all such partial observations. See 
Definition 11. 

- x(y, M) — the exact matching function for the 
tuple y. See Definition 13. 

- X(M) — the exact matching count for the set 
S. See Definition 14. 

- r(y, M) — the relaxed matching function for the 
tuple y. See Definition 15. 

- i?(M) — the relaxed matching count for the set 
S. See Definition 16. 

- T(M) — the number of observations which are 
the same across all sites in the set M. See Defi- 
nition 18. 

- p(i) — the probability that i distinct individuals, 
different in a full observation, are the same in a 
partial observation. See Definition 20. 

Definition 1 Let n be the number of observation 
sites and let Si be the set of observations at the ith 



such site. 

Consider the following toy example with three sites 
(n = 3), 

51 = {A123XYZ, C789ABC} 

5 2 = {A123XYZ,A123XDR,D555SDD} 

53 = {C789ABC,A123XYZ}. 

In passing, it should be noted that a formal require- 
ment for something to be a set is that its members 
are distinct. If this formal requirement is not met 
then each member of the set could be tagged by 
a unique number which is not considered in later 
equality relations. This is a technicality which will 
not be mentioned again and does not affect what 
follows. 

Definition 2 A tuple of observations y = 
(yi, . . . , y n ) is an n-tuple consisting of one member 
of each set of observations — that is, yi £ Si for 
all i. 

Continuing the previous example, 

y = (A123XYZ,A123XYZ,C789ABC) 

is the tuple formed by taking the first observation 

from each set. 

Definition 3 The set of all tuples of observations 
S in the data is the set of all such y which can be 
formed from the sets S\ , . . . , S n . This is clearly the 
cartesian product given by 

S = S 1 xS 2 ---xS n . 

So, in the example framework given before, then 
the set S has twelve members and is given by 

S ={(A123XYZ, A123XYZ, C789ABC), 
(A123XYZ, A123XYZ, A123XYZ), 

. . . (C789ABC, D555SDD, A123XYZ)}. 

Considering, the members of S it is obvious that 

(A123XYZ, A123XYZ, A123XYZ) 

is the type of observation which is most of interest, 

the same individual observed across all three sites 

under investigation. Also, in some way, the tuples 

(A123XYZ, A123XDR, A123XYZ) 

and 

(C789ABC, A123XDR, C789ABC) 

are in some way structurally similar (they match 
at sites one and three) and both are structurally 
different to 
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(A123XYZ, A123XYZ, C789ABC). 
This structural similarity will now be formalised 
by using the concept of a type of match. 
Definition 4 Two n-tuples of observations y = 
(yi, . . . y n ) and z = (zi, . . . z n ) are the same type 
of match (y ~ z) if whenever two elements of y 
match then the same two elements ofz match and 
vice versa. Formally, 

y ~ z if and only if (y { = yj) ( Zi = Zj) 

for alii,] 6 {1,2, . . . ,n}. 

Note that , for simplicity the limits i, j £ {1, 2, . . . , n} 
on indices will usually be omitted where, as in this 
case, they are obvious. It can trivially be shown 
that the relation defined by ~ meets the require- 
ments of an equivalence relation in set theory. 



4. The set of every type of match 

Having formalised the concept of when two sets 
of observations are the same type of match, the 
next step is to introduce an entity which can rep- 
resent the type of match of a given tuple of obser- 
vations. This is simply achieved using partitions of 
the first n integers. A partition of the first n inte- 
gers is a set of sets M = \M\, Mi, ■ ■ ■ M m } such 
that each integer from one to n is in one and only 
one of the sets M\ . . . M m . (In the literature, these 
Mi are often referred to as blocks.) Any n-tuple of 
observations is related to some such M by the re- 
lation given in Definition 7. 

Definition 5 A type of match is a partition M of 
the first n integers which is used to represent the 
structure of matches within an n-tuple of observa- 
tions y. The relationship betweenM. andy is given 
by Definition 7. 

Considering the first three integers, then {{1,2, 3}}, 
{{1,2}, {3}} and {{1}, {2}, {3}} are among the 
possible partitions. 

Definition 6 The set M n is the set of all possible 
partitions of the first n integers. This can be used 
to represent any possible type of match over n ob- 
servation sites. 

For one site only the partition {{1}} is in Ai%. 
For two sites, two possible partitions are available 



{{1,2}} and {{1},{2}}. For three sites, five par- 
titions are avaialble. The enumeration of #M n is 
well understood and uses the Bell numbers (|8j). The 
sequence of the Bell numbers begins 1, 2, 5, 15, 52, 
203, 877, 4140, 21147. 

Definition 7 The type of match of an n-tuple of 
observations y — (yi, . . . , y n ) is given by C(y) £ 
M n where C(y) = M = {Mi, M m } is the par- 
tition of the first n integers which satisfies (yi — 
j/j) <=> i,j G Mk for some k £ [1, 2, . . . , to]. That 
is, M is the partition chosen such that any two site 
indices are in the same block within M if and only 
if the observations in y at those sites are equal. 
It can clearly be seen that C(y) is uniquely speci- 
fied by this definition. To continue with the earlier 
example, if 

y = (A123XYZ,A123XYZ,C789ABC) then 
C(y) = {{l,2}{3}} 
and if 

y = (A123XYZ,A123XYZ,A123XYZ) then 
C(y) = {{1,2,3}}. 

It must now be shown that C (y) works as a rep- 
resentation of the type of match in a consistent way 
with the relationship ~ given by Definition 4. 
Theorem 8 For n-tuples of observations y = 
(yi, ...,y„) and z = (zi, ...,z„) then 

C(y) = C(z) if and only if y ~ z. 

PROOF. Let M y = C(y) and M z = C(z). First 
it must be shown that (y ~ z) =>■ (M. y = M z ). 
This follows trivially. Since (yi = yj) <^> (zi = zj) 
then if i, j are in the same set in M y they must be 
in the same set in M z and if they are in different 
sets in M y they must be in different sets in M z . As 
all integers from one to n appear once each in both 
partitions then it must be the case that Mj, = M z . 

Similarly it must be shown that (M„ = M 2 ) => 
(y ~ z). A very similar argument applies. If i, j are 
in the same set in M a (and therefore in M z ) then 
yi = yj and also Zi = Zj if they are in different 
sets then yi ^ yj and also 2, ^ Zj . Therefore (y, = 
yj) (zi — Zj) and hence y ~ z. 

It is useful at this point to define a shorthand no- 
tation for the type of match of most interest, that 
where the observations are the same at every site. 
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Definition 9 Let A n e M n represent a true 
match, that is the type of match where the same 
observation is made over all n sites. Therefore, 
A„ = {{l,2,...,n}}. 

5. Introducing false matching into the 
framework 

So far the false match problem has been ignored 
and it has been assumed that for a given n-tuplc 
of observations y = (yi, ■ ■ ■ ,y n ) then the relation 
Vi = Vj can be taken at face value. However, the 
original problem was that, in licence plates, par- 
tial observations can lead to two distinct individu- 
als being confused. In order to capture this in the 
described framework, a partial ordering will be in- 
troduced on the set M n and this will then be re- 
lated to the partial observations. (It is somewhat 
unfortunate that this paper uses the phrase "par- 
tial plate survey" from transportation and the term 
"partial ordering" from set theory. These terms 
should not be confused.) 

The next step is to introduce a partial ordering 
on the set M n . It will be seen in the next section 
how this relates to the false matching problem. 
Definition 10 For two partitions 
M = {M 1 ,...,M m }eM n 
and 

M' = {M[,...,M' m ,}&M n 
a partial ordering £ is given by, 

M y M' if and only if(i,j E M k ) => e M/), 

for some k and I. Put more simply, M £ M' if 
whenever i andj are in the same set within M then 
they are also in the same set within M'. 

The symbol >- will be used to mean strictly suc- 
ceeds. That is x >- y means x^y and x^y. The 
symbol >~>~ will be used to mean immediate succes- 
sor that is, ifxyyz then x >- z but there is no y 
such that x >- y >- z. The symbols >-, ^ and -<~< 
will have their obvious meanings. 
It can be trivially shown that this relation meets 
the formal requirements for a partial ordering. It 
should also be noted that this relation is extremely 
close to the original equivalence relation but with 
the implication going in one direction only. It can 
also be shown that under this partial ordering then 



#M the number of sets (blocks) in M e M n is a 
consistent enumeration of A4 n . 

A Hasse diagram is a way of visualising a par- 
tially ordered set. A Hasse diagram is constructed 
by plotting a partially ordered set S graphically 
in such a way that for all x, y G S if x -< y then 
x is further to the bottom of the diagram than y. 
An arrow is drawn in a Hasse diagram from x to 
y if x y>- y. Figure 1 shows the Hasse diagram of 
M.4 with the partial ordering given by the previous 
definition. 

Definition 11 Given an n-tuple of observations 
y = {yi, y n ), let y* = {y\, y*) represent 
the partial observation formed from y. Since a par- 
tial observation can cause distinct individuals to 
appear the same but cannot cause the same indi- 
vidual to appear distinct at different sites then the 
following relation holds, 

{vi = Vi) => (vt = y*j)- 

This star notation will also be used to distinguish 
the set of all possible partial observations in the data 
S* and, in general, to distinguish functions which 
apply to partial data rather than the full data. 
Note that this is the only assumption so far made 
about the nature of the partial observation. In li- 
cence plate surveys then the choosing of which 
part of a plate to survey needs to be made with 
reference to the particular format of plate to be 
observed. Consider the observations from the ear- 
lier example. If y = (A123XYZ, A123XDR, C789ABC) 
then a standard way to make partial observations 
on this type of plate is to collect only the first letter 
and the digits. Therefore y* = (A123, A123, C789). 
Note that C(y) ^ C(y*) since y\ 7^ y 2 buty^ = y* 2 . 
The way that C(y) can change when only a partial 
observation is made is given by the next theorem. 
Theorem 12 Ify = (j/i, . . . , y n ) is an n-tuple of 
observations then 

C(y*) < CM- 
PROOF. Let M = (Mi, . . . , M m ) = C(y) and 
M' = (M{ , . . . , M' m ,) = C(y*). The theorem fol- 
lows trivially from the relation given in Definition 
11. If i, j G Mfc for some k then yi = yj and hence 
Vi = Vj which in turn implies, M{ for some 
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{1}{2}{3}{4 




{1,2}{3}{4} {1,3}{2}{4} {1,4}{2}{3} {1}{2,3}{4} {1}{2,4}{3} {1}{2}{3,4} 



#M = 4 



#M = 3 



{1.2,3}{4} 




{1}{2,3,4} 



{1,2,3,4} 

Fig. 1. Hasse diagram for Ma- 



#M : 



#M = 1 



/. Therefore, G M k ) => G M[) which is 
the condition for the partial ordering. 

From this theorem, it can be seen that when 
only partial data is available, the type of match of 
the partial observation may change only in a given 
way. Specifically, the type of match of the partial 
data can be the same as that of the full data or 
any type of match available by following down the 
arrows on the Hasse diagram. 

Next, some counting functions are defined - 
these are used to enumerate the number of matches 
in the data which are different types of match. 
Definition 13 Let y be an n-tuple of observations 
and M G M n be a type of match. The exact match- 
ing function for an observation y is defined by, 



x(y, M) 



1 if and only if C(y) 
otherwise . 



M 



Definition 14 Let M G M n be a type of match. 
The exact matching function for S the set of all 
observations is given by, 



X(M) 



5>(y,M). 

yes 



Definition 15 Lety be an n-tuple of observations 
and M G A4 n be a type of match. The relaxed 
matching function for an obervation is defined by, 



r(y,M) = 



1 if and only if C(y) ^ M 



otherwise . 
Equivalently, 

r(y,M)= *(y> M ')- 



Definition 16 Let M G A4 n be a type of match. 
The relaxed matching function for S the set of all 
observations is given by 



fl(M) = £V(y,M). 

yes 



Equivalently, 

R(M) = ^ X(M'). 

It should be noted in passing that R(A n ) = X(A n ) 
since there are no M -< A n . 



6. Solving the false match problem 



It can be readily seen that X(M) is the number of 
n-tuples y G S which have a type of match C(y) = 
M. It can be further seen that the original problem 
of counting the number of individuals seen at all 
of n sites is the problem of evaluating X(A n ). 



In order to solve the false match problem, it is 
necessary to prove some simple lemmas which re- 
late these counting functions. The main goal here is 
to estimate X(A n ) (the number of n-tuples repre- 
senting the same individual at all n sites) in terms 
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of the partial data S* . The second goal is to do 
this in a way which does not involve investigating 
every single possible n-tuple. The reason for this is 
that a realistic size for a traffic survey is of the or- 
der of one thousand vehicles. If there are six sites, 
then there are 1000 6 tuples to investigate and this 
would be far too slow computationally. 
Lemma 17 Any exact matching function can be 
expressed in terms of relaxed matching functions 
and "lower" exact matching functions. 

X(M) = R(M) - X(M'). 



PROOF. This follows trivially from Definition 
16. 

This expression can be used recursively so that any 
X(M) can be expressed as a function of i?(M') for 
all M' ^ M. The lemma can be thought of as be- 
ing a version of the inclusion/exclusion principle 
for partitions of the integers under this partial or- 
dering. 

Definition 18 Let M = {mi, . . . , mi} be a set of 

integers, such that mi 6 {1,2, ... ,n} for all i. Let 
S' be the set of l-tuples of observations formed by 
the cartesian product, 

S = S mi x S m . 2 x • • • x S mi . 

In other words, S 1 is the set of l-tuples of observa- 
tions over some subset of the original sites. Then 
define, 

T(M) = X(At), 

where the exact match X(Ai) is in this case over the 
l-tuples inS' rather than the n-tuples inS . In other 
words, T(M) is the number of individuals seen at 
all sites in the set M . 

Note that, It can be easily seen that the problem 
of evaluating T(M) is either exactly the same as 
the original problem, if M = {1,2, ... ,n} or it is 
a sub problem over a reduced number of sites. If 
M has a single member M — {m} then T(M) is 
simply the number of observations in set S m that 
is, T({m}) = #S m . 

Lemma 19 The relaxed matching function R(M.) 
where M = {Mi, . . . , M m } G M n can be expressed 



as a product of exact matches over subsets of sites 
using the expression, 

m 

R(M) = l[T(M i ). 

i=l 

PROOF. Clearly, for an n-tuple of observations 
y = (j/i, • •• ,2/n) then, 

!1 if for alii, j, k then 
{i,j € M k ) => ( yi = yj ) 
otherwise. 

Therefore, 

5>(y,M) = 

yes 

#{y G S : (i,j G M fe ) => { Vl = Vj ) for all i,j, k}. 

The left hand side of this is simply i?(M) as re- 
quired. Since S is the cartesian product then it can 
be seen that those y G S which meet the condition 
are those which are picked out by T(Mi) and there- 
fore the right hand side is n"=i T(Mi) as required. 

Note that if M = A„ then this expression simply 
says R{A n ) = T({1, 2, . . . , n}) = X{A n ). In all 
other cases, this allows a relaxed matching function 
to be expressed as a product of exact matching 
functions over a subset of the original sites. 
Definition 20 The probability p(i) where i G 
{1,2, ... ,n} is the probability that, given that i 
observed individuals are all different in the full 
observation, they will all be the same in the partial 
observation. For the method described to work, this 
p(i) must be independent of the sites at which the 
vehicles are observed. By convention, p(l) = 1. 
It should be noted that this definition does place 
some restrictions on the type of data which can be 
analysed by this method and which types of partial 
observations are suitable. A discussion of p(i) in 
the context of licence plate observations follows 
this section. It is likely that other formulations of 
this problem would be possible if p(i) varies with 
the sites considered. 

Lemma 21 An unbiased estimator t for X (A n ) is 
given by, 

t = X*(A n )- ]T p(#M)X(M). 

M^A„ 
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PROOF. The quantity X*(A n ) is equal to 
X(A n ) plus all those n-tuples of observations 
which are false matches. Each element of the sum 
represents the number of false matches arising from 
a given type of match. Writing this out formally, 

t=X*(A n )- 

£ E [#{y G S : C(y*) = A n , C(y) = M}} . 

The set {y G S : C(y*) = A„, C(y) = M} is the 
set of n-tuples in the data S which are a match of 
type M in the complete data but appear to be a 
match of type A„ in the partial data S* . Now, the 
number of distinct individuals in this n-tuple must 
be #M. Therefore, 

P [C(y*) = A„|C(y) = M)] = p(#M). 

Bayes theorem gives, 

P[C(y*)=A„,C(y) = M)] 
= p(#M)P[C(y)=M] 
= p(#M)X(M) 

Hence, the expected number of false matches aris- 
ing from each type of match can be given by, 

E[#{yG5:C(y*)=A„,C(y)=M}] 
= #5P [C(y*) = A„,C(y)=M)], 

and the lemma follows immediately. 

It may not be immediately obvious that Lemmas 
21, 17 and 19 together allow an unbiased estimate 
of the number of true matches, from the partial 
plate data (assuming that the p(i) are known). 
First, looking at Lemma 21, the quantity X*(A n ) 
can be simply enumerated by computer in the par- 
tial data. Therefore, this lemma allows an unbi- 
ased estimate of the number of matches in the com- 
plete data if an unbiased estimate of A"(M) can 
be found for all M >- A n . Now, Lemma 17 allows 
A(M) to be expressed as a sum of i?(M') for all 
M' ^ M. Lemma 19 allows those R(M') to be ei- 
ther equal to the original required quantity X(A n ) 
or to be expressed in terms of a product involving 
subproblems on a reduced number of sites. Hence, 



computer algebra can be used to give an equation 
which is in terms of X(A n ) (the quantity desired), 
X*(A n ) (measureable on the data), p(i) (assumed 
to be known) and T(M) (which is a subproblem 
of the original problem with a reduced number of 
sites). The computer can then be used to recur- 
sively solve the subproblem which has already been 
shown to be trivial for just one site. An expanded 
description of this solution process is given in |l|, 
Chapter 4). 

6.1. Estimating the probability of false matches 

The method described here relies on a good es- 
timate of p(i) and also on the assumption that this 
does not vary by the sites chosen. The specific de- 
tails of British licence plates are not of general in- 
terest (and it should again be stressed that the 
method discussed here is general and not limited 
just to specific types of licence plate survey, indeed 
it could be used for any type of data collection 
where the restrictions on p(i) are met). However, 
illustrating how p(i) can be estimated in a prac- 
tical case might be of interest and illuminate how 
the method was applied in real life. More details 
on this can be found in (HI, Chapter 5). 

Two methods of estimating p(f) are practical. If 
the distribution of the vehicle types can be calcu- 
lated then an analytical approach is possible. Let 
there be TV vehicle types which are distinguishable 
in the partial observations and let fj be the propor- 
tion of the vehicle fleet which is of type j (assume 
the membership of each type is relatively large). 
Therefore, p(i) is approximately given by, 

N 

In the case of the old style British licence plates 
discussed, the distribution of the digits is almost a 
flat distribution from 1 to 999. The distribution of 
the year letters is more complex and can be esti- 
mated from consideration of the data. Therefore, 
fj can be calculated for each possible partial ob- 
servation and hence p(i). 

An alternative method is to estimate p(2) by 
finding two sites which are so far separated geo- 
graphically that no vehicle could be seen at both. 
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Any vehicle seen at both must be a false match and 
therefore if there are x observed matches in the par- 
tial data and then p(2) = x/(#5i#52). Similarly 
p(3) can be estimated by finding three such geo- 
graphically distant sites. Higher order p(i) can be 
estimated with reference to the previous method 
or by assuming a functional form for the fall off. 
An estimate of p(2) = 7.4 x 10~ 6 was given in (jl|, 
Chapter 5) for licence plate data of the type dis- 
cussed. 



7. Results on simulated data 

Table 1 shows simulation results for between two 
and six observation sites. These could be thought of 
as one site observed on several days, or six sites ob- 
served on several different days. Num. Veh. refers 
to the total number of observations at each of the 
sites (in these simulations, there are the same num- 
ber of vehicles in each data set). The five columns 
of the form 1 — n refer to the number of vehicles 
which genuinely went from site one to site n visit- 
ing all sites in between. If this column is blank it 
means that there was no site n. For example, if 1 — 
2 = 100, 1 - 3 = 200 and 1 - 4 is blank. This means 
that 100 vehicles travelled between site one and 
site two, 200 vehicles travelled between sites one, 
two and three and there were only three sites. Note 
that these are cumulative so that if 1 - 2 = 20 and 
1-3 = 10 this means that 30 vehicles in total went 
from site one to site two and ten of them contin- 
ued to site three. Thus the first experiment is two 
sites, 1000 vehicles at each for which there were ten 
vehicles which were genuinely seen at both sites. 
In every experiment, the number of different vehi- 
cle types was set at 10,000 with a flat distribution. 
Note that the simplifying assumptions of a flat dis- 
tribution and the same number of vehicles at each 
site are simply there to make the experiment eas- 
ier to understand rather than being necessary for 
the method to work. It should be clear that the 
desired answer from the correction process is the 
rightmost figure in these columns. 

Each experiment is repeated twenty times with 
simulated data being generated anew each time. 
The correction process has no random element and 



will always give the same result for the same data. 
The mean raw number of matches is given — this 
is the total number of n-tuples which were seen to 
have the same value for each observation at every 
site (averaged over the twenty simulation runs). 
Because of the combinatorial nature of the proce- 
dure, this could, in principle, be much larger than 
the number of vehicles in any of the data sets (since 
it counts any n-tuple) . The sample standard devi- 
ation (a) is given for the raw matches. The mean 
estimated correct number of matches is then given 
(again averaged over the twenty simulations) . The 
sample standard deviation a is then given for the 
twenty corrected matches. It is clear that the most 
important test is that the mean corrected number 
of matches is as near to correct as possible. How- 
ever, it should also be kept in mind that in reality, 
a researcher could only run the matching proce- 
dure once on any given set of data so it is also im- 
portant that a is as low as possible. A significant 
improvement to the method would be to estimate 
the variance as well as producing the mean in or- 
der that the researcher could have some idea as to 
the likely accuracy of the corrected results. 

The first five rows are all results on just two test 
sites. This procedure is not the ideal one to use for 
estimates on matches between just two sites and 
the work of other authors in the field should be 
used in such a circumstance. However, these results 
are included here for completeness. In the first ex- 
periment, the average number of raw matches over 
the twenty runs is 111.4. The average number of 
corrected matches is 100 less than this (11.4). This 
is close to the correct answer of 10. However, it 
should be noticed that the a is high in comparison 
to the actual answer. In this case, the a is 8.5 which 
is of the same order of magnitude as the answer. 
This is to be expected since we are looking for only 
10 true matches in over 110 observed matches. If 
we increase the number of vehicles to 2000 then, 
as would be expected, the number of false matches 
goes up (to approximately 400) and the a also rises 
(to almost 20). 

The next five rows of results are all over three 
sites. In the first of these, 10 vehicles travel between 
all three and all other matches are coincidence. 
1000 vehicles are observed at all sites. The mean 
corrected match across all sites 9.3 is close to the 
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actual answer of 10 and the a is lower than in the 
two site case. However, when the same experiment 
is run with 500 vehicles travelling from sites one to 
two in addition to 10 vehicles travelling from sites 
two to three, the a increases markedly (it almost 
doubles). In all cases with three sites, the mean is 
a good estimate and the a is generally low enough 
that a good estimate can be expected. 

The next four rows of results are for experiments 
made over four sites. The first experiment has 100 
vehicles which visit all four. The mean corrected 
match is 104 (very close) and the a is only 22. It is 
hard to explain why this a actually falls in the next 
experiment when more vehicles are genuinely seen 
in common between the other sites. In all cases the 
mean of the predictions is approximately correct 
(the worst performance being in the case of the 
fourth experiment when the mean was 106.1 not 
100). 

The next six rows of results are experiments 
made over five sites. Again, the mean corrected re- 
sults arc approximately correct. However, in the 
worst case, the mean is 11 too high and the a in 
the results is 46.7 which is comparable to the level 
of the effect being observed. In this case approxi- 
mately 120 false matches are being removed each 
time. However, previous experiments have been 
able to correct for a greater proportion of false 
matches with less a in the result. 

The final four rows of results are experiments 
over six sites. This was the largest number of sites 
for which it was practical to do runs of twenty 
or more simulations with the computer power 
available. Again, the mean corrected estimate of 
matches was nearly correct in all cases. The worst 
performance was an estimate of 92.2 (correct re- 
sult 100). The a was, however, relatively high. 
This was a surprise in some cases — particularly 
the first row of results where the mean number of 
false matches was only 21.2. In many senses, the 
worst results was the final one where a a of 55.0 
was given on an corrected prediction of only 101.3. 

The time taken to do one run over six sites with 
one thousand pieces of data on each site was thirty 
seconds on a Celeron 366 computer running Debian 
Linux. Six sites with one thousand vehicles at each 
is a reasonable size for a typical traffic survey. It is 
practical (if time consuming) to do experiments on 



seven sites, even using such comparatively obsolete 
equipment. However, eight sites or more is proba- 
bly too computationally expensive for the moment 
and this is a limitation of the method outlined. 
The exact rate at which the computational require- 
ments increase with the number of sites is hard to 
determine. It will relate to the Bell numbers, to 
the number of observations at each site and to the 
number of pairs of observations at each site pair. 

The results given here are certainly consistent 
with the idea that the method gives an unbiased es- 
timator for the true number of matches. In some ex- 
periments, there were problems with the standard 
deviation being higher than would be desirable in 
real cases. It is important to bear in mind that 
these were relatively extreme tests of the method 
since p(2) and p(3) were relatively low and the 
number of samples given were quite high. Often the 
method was attempting to predict only ten true 
matches in a number of observed matches which 
might be several hundred. 

To test the method more fully, four very extreme 
tests were given. Each of these tests involved six 
sites at each of which one thousand vehicles were 
observed. Interacting flows were chosen to cause 
a large number of false matches in a diversity of 
ways. Because these experiments were chosen to 
cause a large number of false matches then one 
thousand runs of each experiment were performed. 
The averaged results are shown in Table 2. 

In experiment one, five hundred vehicles trav- 
elled from one to five and five hundred from two 
to six. The remaining five hundred vehicles at sites 
one and six were appeared nowhere else. No vehi- 
cles made the complete journey. As can be seen, 
on average over seven hundred false matches were 
seen and the standard deviation between runs was 
extremely large. However, the mean was within 
twelve of the correct answer (zero) although the 
standard deviation was large. In such extreme cir- 
cumstances, a single experiment would be next to 
useless but it is good evidence that the method was 
unbiased. 

In experiment two, five hundred vehicles trav- 
elled from one to three. Five hundred vehicles trav- 
elled from four to six. Five hundred vehicles visited 
only odd numbered sites and five hundred vehicles 
visited only even numbered sites. In this experi- 
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No. 


1 - 


2 1- 


3 1- 


4 1- 


5 1-6 


Av. Raw 


cr Raw 


Av. Cor. 


cr Cor. 


Veh. 












Matches 


Matches Matches 


Matches 


1000 


10 










111.4 


8.5 


11.4 


8.5 


2000 


10 










411.8 


19.5 


11.8 


19.5 


1000 


100 










199.2 


12.0 


99.2 


12.0 


1000 


200 










302.3 


7.7 


202.3 


7.7 


1000 


500 










596.6 


12.3 


496.7 


12.3 


1000 





10 








21.9 


4.6 


9.3 


3.3 


1000 


500 


10 








73.8 


7.5 


10.2 


6.2 


1000 


100 


100 








152.1 


8.5 


101.9 


7.5 


1000 


500 


250 








388.3 


22.7 


253.2 


20.1 


1000 





500 








667.2 


24.9 


506.0 


22.3 


1000 








100 






154.6 


26.6 


104.0 


22.6 


1000 


100 


100 


100 






164.4 


11.4 


97.7 


9.3 


500 


100 


100 


100 






140.7 


19.3 


105.8 


17.4 


1000 


500 


250 


100 






207.8 


29.7 


106.1 


23.7 


500 


10 


10 


10 


10 




14.2 


2.2 


10.5 


1.8 


1000 


10 


10 


10 


10 




17.4 


4.1 


9.4 


2.8 


500 


50 


50 


50 


50 




71.3 


14.3 


47.8 


12.3 


500 


100 


100 


100 


100 




151.9 


26.9 


92.0 


22.3 


1000 











100 




177.6 


29.9 


103.4 


22.6 


1000 


100 


100 


100 


100 




222.2 


61.5 


111.0 


46.7 


1000 














10 


21.2 


13.4 


12.3 


9.9 


500 














100 


152.6 


45.5 


92.2 


37.3 


1000 














100 


214.6 


58.0 


103.5 


40.2 


1000 


100 


100 


100 


100 


100 


289.8 


88.4 


101.3 


55.0 



Table 1 

Simulation results — all performed over twenty runs with 10,000 distinct vehicle types. 



Experiment Expected 
Number Answer 


Av.Raw cr Raw Av.Cor. a Cor. 
Matches Matches Matches Matches 


1 





739 


305 


11.9 


196 


2 





110 


45.5 


-0.950 


27.1 


3 


250 


836 


287 


249 


205 


4 


500 


1920 


531 


496 


356 



Table 2 

Simulation results — all performed over one thousand runs with 10,000 distinct vehicle types. 
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ment the corrected mean result was almost exact 
(within one) and the standard deviation was much 
lower than the other three experiments. 

In experiment three, two hundred and fifty ve- 
hicles travelled to all sites. Five hundred vehicles 
went from site one to three and five hundred from 
four to six. The remaining two hundred and fifty 
vehicles at each site visited only that single site. As 
can be seen, the corrected result is almost exactly 
correct although, again, the standard deviation is 
so high that a single reading would be worthless. 

In experiment four, five hundred vehicles visited 
every site. Two hundred and fifty vehicles went 
from sites one to three. Two hundred and fifty ve- 
hicles went from sites four to six. Two hundred and 
fifty vehicles visited only sites one and two, two 
hundred and fifty vehicles visited only sites three 
and four and two hundred and two hundred and 
fifty vehicles visited only sites five and six. Again, 
the mean of all results is very close (within four 
vehicles) but the standard deviation is the highest 
yet seen. This is not surprising. The mean num- 
ber of raw tuples of matches averaged nearly 2000, 
twice the number of vehicles at each site. 

These four tests provide a convincing demon- 
stration that the method is, indeed, unbiased as 
was shown by theory. 



8. Conclusions 

This paper presented a framework for analysis of 
surveys where matches are required over more than 
two data collection points. The framework given 
formalises the concept of a type of match using the 
concept of the equivalence class. Further a method 
is given for evaluating M n the set of all possible 
types of match over multiple data sets. 

The framework given is then applied to the prob- 
lem of false matches — which is put into the lan- 
guage of set theory using the concept of a partial 
ordering. It is shown how this partial ordering can 
be used to visualise, by means of a Hasse diagram, 
the ways in which false matches can occur in data 
observed at multiple sites. The framework was then 
used to design and implement an algorithm which 
was used to estimate the number of true matches in 



simulated data. The algorithm has also been tested 
on real data from partial plate surveys. 

This algorithm was implemented and tested on 
simulated data. The results show that the esti- 
mator seems to be unbiased and in the majority 
of cases tested the standard deviation on the re- 
sults is low. The method is suitable for analysis 
of matches on data between three and seven test 
sites but becomes too computationally intensive 
after this point. A significant improvement to the 
method would be the estimation of a variance as 
well as a corrected number of matches. A potential 
weakness of the method is that it relies on good 
estimates for p(i). 
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